Pathogenicity language model

ABSTRACT

A system comprises chunking logic that chunks (or splits) a multiple sequence alignment (MSA) into chunks, first attention logic that attends to a representation of the chunks and produces a first attention output, first aggregation logic that produces a first aggregated output that contains those features in the first attention output that correspond to masked residues in the plurality of masked residues, mask revelation logic that produces an informed output based on the first aggregated output and a Boolean mask, second attention logic that attends to the informed output and produces a second attention output based on masked residues revealed by the Boolean mask, second aggregation logic that produces a second aggregated output that contains those features in the second attention output that correspond to masked residues concealed by the Boolean mask, and output logic that produces identifications of the masked residues based on the second aggregated output.

PRIORITY APPLICATIONS

This application claims the benefit of and priority to the following:

U.S. Provisional Pat. Application No.: 63/294,813, titled “PERIODIC MASKPATTERN FOR REVELATION LANGUAGE MODELS,” filed Dec. 29, 2021 (AttorneyDocket No. ILLM 1063-⅟IPp-2296-PRV);

U.S. Provisional Pat. Application No.: 63/294,816, titled “CLASSIFYINGMILLIONS OF VARIANTS OF UNCERTAIN SIGNIFICANCE USING PRIMATE SEQUENCINGAND DEEP LEARNING,” filed Dec. 29, 2021 (Attorney Docket No. ILLM1064-⅟IP-2297-PRV);

U.S. Provisional Pat. Application No.: 63/294,820, titled “IDENTIFYINGGENES WITH DIFFERENTIAL SELECTIVE CONSTRAINT BETWEEN HUMANS ANDNON-HUMAN PRIMATES,” filed Dec. 29, 2021 (Attorney Docket No. ILLM1065-⅟IP- 2298-PRV);

U.S. Provisional Pat. Application No.: 63/294,827, titled “DEEP LEARNINGNETWORK FOR EVOLUTIONARY CONSERVATION,” filed Dec. 29, 2021 (AttorneyDocket No. ILLM 1066-⅟IP-2299-PRV);

U.S. Provisional Pat. Application No.: 63/294,828, titled “INTER-MODELPREDICTION SCORE RECALIBRATION,” filed Dec. 29, 2021 (Attorney DocketNo. ILLM 1067-⅟IP-2301-PRV); and

U.S. Provisional Pat. Application No.: 63/294,830, titled“SPECIES-DIFFERENTIABLE EVOLUTIONARY PROFILES,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1068-⅟IP-2302-PRV).

The priority applications are incorporated by reference as if fully setforth herein.

FIELD OF THE TECHNOLOGY DISCLOSED

The technology disclosed relates to artificial intelligence typecomputers and digital data processing systems and corresponding dataprocessing methods and products for emulation of intelligence (i.e.,knowledge based systems, reasoning systems, and knowledge acquisitionsystems); and including systems for reasoning with uncertainty (e.g.,fuzzy logic systems), adaptive systems, machine learning systems, andartificial neural networks. In particular, the technology disclosedrelates to using neural networks to analyze ordered data.

INCORPORATIONS

The following are incorporated by reference for all purposes as if fullyset forth herein:

-   Sundaram, L. et al. Predicting the clinical impact of human mutation    with deep neural networks. Nat. Genet. 50, 1161-1170 (2018);-   Jaganathan, K. et al. Predicting splicing from primary sequence with    deep learning. Cell 176, 535-548 (2019);-   U.S. Pat. Application No. 17/975,536, titled “MASK PATTERN FOR    PROTEIN LANGUAGE MODELS,” filed on Oct. 27, 2022 (Attorney Docket    No. ILLM 1063-2/IP-2296-US1);-   U.S. Pat. Application No. 62/573,144, titled “TRAINING A DEEP    PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,”    filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-⅟-1611-PRV);-   U.S. Pat. Application No. 62/573,149, titled “PATHOGENICITY    CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),”    filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-2/IP-1612-PRV);-   U.S. Pat. Application No. 62/573,153, titled “DEEP SEMI-SUPERVISED    LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filed    Oct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);-   U.S. Pat. Application No. 62/582,898, titled “PATHOGENICITY    CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL    NETWORKS (CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM    1000-4/IP-1618-PRV);-   U.S. Pat. Application No. 16/160,903, titled “DEEP LEARNING-BASED    TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed    on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611-US);-   U.S. Pat. Application No. 16/160,986, titled “DEEP CONVOLUTIONAL    NEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on Oct. 15, 2018    (Attorney Docket No. ILLM 1000-6/IP-1612-US);-   U.S. Pat. Application No. 16/160,968, titled “SEMI-SUPERVISED    LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL    NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM    1000-7/IP-1613-US);-   U.S. Pat. Application No. 16/160,978, titled “DEEP LEARNING-BASED    SPLICE SITE CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket    No. ILLM 1001-4/IP-1680-US);-   U.S. Pat. Application No. 16/407,149, titled “DEEP LEARNING-BASED    TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,”    filed May 8, 2019 (Attorney Docket No. ILLM 1010-⅟-1734-US);-   U.S. Pat. Application No. 17/232,056, titled “DEEP CONVOLUTIONAL    NEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING    THREE-DIMENSIONAL (3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021,    (Atty. Docket No. ILLM 1037-2/IP-2051-US);-   US Pat. Application No. 63/175,495, titled “MULTI-CHANNEL PROTEIN    VOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP    CONVOLUTIONAL NEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty.    Docket No. ILLM 1047-1/P-2142-PRV);-   U.S. Pat. Application No. 63/175,767, titled “EFFICIENT VOXELIZATION    FOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty. Docket No. ILLM    1048-1/P-2143-PRV);-   U.S. Pat. Application No. 17/468,411, titled “ARTIFICIAL    INTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D)    STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM    1037-3/IP-2051A-US);-   U.S. Provisional Pat. Application No.: 63/253,122, titled “PROTEIN    STRUCTURE-BASED PROTEIN LANGUAGE MODELS,” filed Oct. 6, 2021    (Attorney Docket No. ILLM 1050-⅟P-2164-PRV);-   U.S. Provisional Pat. Application No.: 63/281,579, titled    “PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION    USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed Nov.    19, 2021 (Attorney Docket No. ILLM 1060-⅟P-2270-PRV); and-   U.S. Provisional Pat. Application No.: 63/281,592, titled “COMBINED    AND TRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING    GAPED AND NON-GAPED PROTEIN SAMPLES,” filed Nov. 19, 2021 (Attorney    Docket No. ILLM 1061-⅟P-2271-PRV).

BACKGROUND

The subject matter discussed in this section should not be assumed to beprior art merely as a result of its mention in this section. Similarly,a problem mentioned in this section or associated with the subjectmatter provided as background should not be assumed to have beenpreviously recognized in the prior art. The subject matter in thissection merely represents different approaches, which in and ofthemselves can also correspond to implementations of the claimedtechnology.

The explosion of available biological sequence data has led to multiplecomputational approaches that infer the proteins’ three-dimensionalstructure, biological function, fitness, and evolutionary history fromsequence data. So-called protein language models, like the ones based onthe Transformer architecture, have been trained on large ensembles ofprotein sequences by using the masked language modeling objective offilling in masked amino acids in a sequence, given the surrounding ones.

Protein language models capture long-range dependencies, learn richrepresentations of protein sequences, and can be employed for multipletasks. For example, protein language models can predict structuralcontacts from single sequences in an unsupervised way.

Protein sequences can be classified into families of homologous proteinsthat descend from an ancestral protein and share a similar structure andfunction. Analyzing multiple sequence alignments (MSAs) of homologousproteins provides important information about functional and structuralconstraints. The statistics of MSA columns, representing amino-acidsites, identify functional residues that are conserved during evolution.Correlations of amino acid usage between the MSA columns containimportant information about functional sectors and structural contacts.

Language models were initially developed for natural language processingand operate on a simple but powerful principle: they acquire linguisticunderstanding by learning to fill in missing words in a sentence, akinto a sentence completion task in standardized tests. Language modelsdevelop powerful reasoning capabilities by applying this principleacross large text corpora. The Bidirectional Encoder Representationsfrom Transformers (BERT) model instantiated this principle usingTransformers, a class of neural networks in which attention is theprimary component of the learning system. In a Transformer, each tokenin the input sentence can “attend” to all other tokens by exchangingactivation patterns corresponding to the intermediate outputs of neuronsin a neural network.

Protein language models like the MSA Transformer have been trained toperform inference from MSAs of evolutionarily related sequences. The MSATransformer interleaves per-sequence (“row”) attention with per-site(“column”) attention to incorporate epistasis. Epistasis leads to aco-evolution of certain protein positions. The effect of mutation at onesite depends on presence or absence of mutations at other sites, whichinfluences mutation. Combinations of row attention heads in the MSATransformer have led to state-of-the-art unsupervised structural contactpredictions.

End-to-end deep learning approaches for variant effect predictions areapplied to predict the pathogenicity of missense variants from proteinsequence and sequence conservation data (See Sundaram, L. et al.Predicting the clinical impact of human mutation with deep neuralnetworks. Nat. Genet. 50, 1161-1170 (2018), referred to herein as“PrimateAI”). PrimateAI uses deep neural networks trained on variants ofknown pathogenicity with data augmentation using cross-speciesinformation. In particular, PrimateAI uses sequences of wild-type andmutant proteins to compare the difference and decide the pathogenicityof mutations using the trained deep neural networks. Such an approachthat utilizes the protein sequences for pathogenicity prediction ispromising because it can avoid the circularity problem and overfittingto previous knowledge. Compared to the adequate number of data to trainthe deep neural networks effectively, the number of clinical dataavailable in ClinVar is relatively small. To overcome this datascarcity, PrimateAI uses common human variants and variants fromprimates as benign data while mutation rate-matched samples ofunlabelled data, based on trinucleotide context, were used as unknowndata.

An opportunity arises to use protein language models and MSAs forvariant pathogenicity prediction. More accurate variant pathogenicityprediction may result.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, like reference characters generally refer to like partsthroughout the different views. Also, the drawings are not necessarilyto scale, with an emphasis instead generally being placed uponillustrating the principles of the technology disclosed. In thefollowing description, various implementations of the technologydisclosed are described with reference to the following drawings, inwhich.

FIG. 1 is a high-level diagram that shows various aspects of thetechnology disclosed, and, in particular, illustrates generating amasked MSA and processing the masked MSA through the disclosed PrimateAIlanguage model to produce a phenotype prediction.

FIG. 2 shows one implementation of applying the disclosedperiodically-spaced mask grid to an MSA and generating the disclosedpartially-masked MSA.

FIG. 3 shows one implementation of one-hot tokens that are defined forthe twenty residue one-hot vectors, the gap residue one-hot vector, andthe mask one-hot vector.

FIG. 4 illustrates one implementation of channel embeddings that aredefined for the twenty residue channel embedding sets, the gap channelembedding set, and the mask channel embedding set.

FIG. 5 shows cropping, padding, and masking of MSAs in accordance withvarious implementations of the technology disclosed.

FIG. 6 depicts one implementation of generating the disclosed MSArepresentation.

FIG. 7 illustrates an example architecture of the disclosed PrimateAIlanguage model.

FIG. 8 shows details of the disclosed mask revelation.

FIG. 9 shows various components of the PrimateAI language model.

FIG. 10 shows one implementation of the disclosed revelation output headused by the disclosed PrimateAI language model.

FIG. 11 is a computer-implemented method of the logic flow of thePrimateAI language model, in accordance with one implementation of thetechnology disclosed.

FIG. 12 is a system that is configured to implement the PrimateAIlanguage model, in accordance with one implementation of the technologydisclosed.

FIG. 13 shows the performance evaluation of the language modelling partof the disclosed PrimateAI language model with other language models.

FIG. 14 depicts the Top-1 training accuracy of the disclosed PrimateAIlanguage model.

FIG. 15 is a computer system that can be used for compilation andruntime execution of the disclosed PrimateAI language model.

FIG. 16 illustrates a comparison between UniRef50 HHblits MSAs and humanHHblits MSAs.

FIG. 17 illustrates the training of the PrimateAI language model usingLAMB optimizer with gradient pre-normalization

DETAILED DESCRIPTION

The following discussion is presented to enable any person skilled inthe art to make and use the technology disclosed and is provided in thecontext of a particular application and its requirements. Variousmodifications to the disclosed implementations will be readily apparentto those skilled in the art, and the general principles defined hereinmay be applied to other implementations and applications withoutdeparting from the spirit and scope of the technology disclosed. Thus,the technology disclosed is not intended to be limited to theimplementations shown but is to be accorded the widest scope consistentwith the principles and features disclosed herein.

The detailed description of various implementations will be betterunderstood when read in conjunction with the appended drawings. To theextent that the figures illustrate diagrams of the functional blocks ofthe various implementations, the functional blocks are not necessarilyindicative of the division between hardware circuitry. Thus, forexample, one or more of the functional blocks (e.g., modules,processors, or memories) may be implemented in a single piece ofhardware (e.g., a general purpose signal processor or a block of randomaccess memory, hard disk, or the like) or multiple pieces of hardware.Similarly, the programs may be stand-alone programs, may be incorporatedas subroutines in an operating system, may be functions in an installedsoftware package, and the like. It should be understood that the variousimplementations are not limited to the arrangements and instrumentalityshown in the drawings.

The processing engines and databases of the figures, designated asmodules, can be implemented in hardware or software, and need not bedivided up in precisely the same blocks as shown in the figures. Some ofthe modules can also be implemented on different processors, computers,or servers, or spread among a number of different processors, computers,or servers. In addition, it will be appreciated that some of the modulescan be combined, operated in parallel or in a different sequence thanthat shown in the figures without affecting the functions achieved. Themodules in the figures can also be thought of as flowchart steps in amethod. A module also need not necessarily have all its code disposedcontiguously in memory; some parts of the code can be separated fromother parts of the code with code from other modules or other functionsdisposed in between.

Introduction

The disclosed PrimateAI language model uses a masked language modelingobjective for training on sequences. During training, residues atdifferent positions in a sequence are replaced with a mask token and thePrimateAI language model is trained to predict the original residues atthose positions.

Masked language modeling allows training on a large amount of unlabelleddata. Fill-in-the-blank multiple sequence alignment (MSA) Transformerssimultaneously classify multiple masked locations in MSAs duringtraining. Higher numbers of mask locations can add more masked languagemodelling (MLM) gradients that inform optimization, thereby enabling ahigher learning rate and faster training.

However, fill-in-the-blank pathogenicity prediction is fundamentallydifferent from traditional MLM as classification at a mask locationdepends on predicted values of residues at other mask locations. Theclassification scores may often be the averages of conditionalpredictions over all possible combinations of residues at other masklocations.

The PrimateAI language model avoids this averaging by revealing maskedtokens at other mask locations before making predictions. The PrimateAIlanguage model achieves state-of-the-art clinical performance anddenoising accuracy whilst requiring 50x less computation for trainingthan previous MSA Transformers. Various aspects of the technologydisclosed, discussed later, contribute to the 50x reduction in trainingcompute. Examples of such aspects include periodically-spaced mask grid,mask revelation, and the architecture of PrimateAI language model.

The PrimateAI language model can be considered an MSA Transformer forfill-in-the-blank residue classification. In one implementation, thePrimateAI language model is trained end-to-end on MSAs of UniRef50proteins to minimize an unsupervised MLM objective. The PrimateAIlanguage model outputs classification scores for alternative andreference residues, which serve as inputs to the PrimateAIthree-dimensional (3D) rank loss.

Phenotype Prediction

FIG. 1 is a high-level diagram 100 that shows various aspects of thetechnology disclosed, and, in particular, illustrates generating amasked MSA 140 and processing the masked MSA 140 through the disclosedPrimateAI language model (i.e., a phenotype predictor 150 orpathogenicity language model) to produce a phenotype prediction 160.

In one implementation, an MSA dataset 110 includes a multiple sequencealignment (MSA) 120 for each sequence in a UniRef50 database that isretrieved by searching a UniClust30 database. The MSA 120 is analignment of multiple homologous protein sequences to a target protein.From the MSA 120, the degree of homology can be inferred and theevolutionary relationships among the sequences studied. Since realprotein sequences are likely to have insertions, deletions, andsubstitutions, the sequences are aligned by minimizing a Levenshteindistance-like metric over all the sequences. In some implementations,heuristic alignment schemes are used. For example, tools like JackHMMERand HHblits can increase the number and diversity of sequences returnedby iteratively performing the search and alignment steps.

It is difficult to incorporate nearby evolution due to mutationaldifferences in creatures with a recent ancestor being significantlyinfluenced by electromechanical susceptibilities of proteins tomutations. To avoid this, the MSAs used by the technology disclosedcontain diverse proteins that align with the query sequence. Usingdiverse sequences from many species reduces the influence ofelectromechanical susceptibility on predictions as the differences aremore highly determined by natural selection.

In some implementations, the MSA dataset 110 can contain twenty-sixmillion MSAs that are created by using the protein homology detectionsoftware HHblits. In other implementations, an additional set of MSAscan be generated for 19,071 human proteins using HHblits. A personskilled in the art will appreciate that the technology disclosed cansearch, generate, and otherwise leverage (or use) any number of MSAs.

In some implementations, those UniRef50 MSAs can be excluded from theMSA dataset 110 whose query sequences carry rare amino acids, therebyretaining only those MSAs in MSA dataset 110 that contain the twentymost abundant residues. In other implementations, only those non-querysequences can be included in the MSAs that contain the twenty mostcommon residues and gaps, which in turn represent deletions relative tothe query sequence.

In some implementations, the MSAs that are provided as inputs to thePrimateAI language model can have a fixed size of 1024 sequences. Of the1024 sequences, up to 1023 non-query sequences can be randomly sampledfrom the filtered sequences if the MSA depth is larger than 1024. If theMSA depth is less than 1024, the MSA can be padded with zeros to fillthe input. The MSA depth refers to the number of protein sequences inthe MSA. For example, the MSA transformer with a fixed input MSA depthof 1024 sequences can be trained. This eases the process of the modelbecause the tensors input to the model have a fixed shape. If the fullMSA depth is less than 1024, padding can be added to increase its sizeto 1024. If the full MSA depth is more than 1024, 1023 sequences can berandomly sampled from the full MSA depth. The one query sequence can bekept such that the remaining MSA has a depth of 1024 (1023 randomlysampled sequences and 1 query sequence).

A masking logic 130 can apply one or more masks to the MSA 120 andgenerate a masked MSA 140. The masks can be arranged in a periodicmanner, non-periodic manner, regular manner, or irregular manner. Themasks are not limited to periodically-spaced masks or a regular grid orarray of masks. The masks can be irregular in shape, can be straight orcurved, and can be arranged in irregular, non-evenly spaced patterns.The masks are regular in shape when the distance between adjacent masksis fixed or same. The masks are irregular in shape when the distancebetween adjacent masks varies.

The phenotype predictor 150 (e.g., the PrimateAI language model) canprocess the masked MSA 140 and generate the phenotype prediction 160. Inone implementation, the phenotype prediction 160 outputs the identity ofthe masked residues in the masked MSA 140. In other implementations, thephenotype prediction 160 can be used for variant pathogenicityprediction, protein contact map generation, protein functionalityprediction, and so on.

Note that portions of this Application refer to a protein as a“sequence,” “residue sequence,” “amino acid sequence,” and “chain ofamino acids” interchangeably. Also, note that portions of thisApplication use “amino acids” and “residues” interchangeably. Furthernote that portions of this Application use “a set of periodically-spacedmasks,” “periodically-spaced masks,” “mask grid,” “periodically-spacedmask gird,” “periodic mask pattern,” and “fixed mask pattern”interchangeably.

The sequences shown in the figures are protein sequences comprisingamino acid residues. In other implementations, the sequences can insteadcomprise DNA, RNA, carbohydrates, lipids or any other straight orbranched biopolymer.

Having described the technology disclosed at a high level using FIG. 1 ,the discussion now turns to the disclosed periodically-spaced maskgrid—a particular implementation of the masking logic 130.

Periodically-Spaced Mask Grid

FIG. 2 shows one implementation of applying the disclosedperiodically-spaced mask grid 210 to an MSA 220 and generating thedisclosed partially-masked MSA 230.

The columns of the periodically-spaced mask grid 210 correspond toresidue positions. The residue positions are also referred to herein asordinal positions. For example, in FIG. 2 , the periodically-spaced maskgrid 210 has nine columns corresponding to nine residue positions (i.e.,r = 9).

The periodically-spaced mask grid 210 has elements (or units or tokens)that are masks. In FIG. 2 , such mask elements are depicted by boxeswith black fill and a “?” symbol). The periodically-spaced mask grid 210also has elements (or units or tokens) that are not masks. In FIG. 2 ,such non-mask elements are depicted by boxes with yellow fill.

The rows of the periodically-spaced mask grid 210 include elements thatare masks and elements that are not masks. The rows of theperiodically-spaced mask grid 210 are referred to herein as maskdistributions. For example, in FIG. 2 , there are five maskdistributions 1-5 (i.e., m mask distributions, where m = 5).

Each mask distribution has k periodically-spaced masks. For example, inFIG. 2 , mask distributions 1-4 each have three masks (i.e., k = 3), andmask distribution 5 has two masks (i.e., k= 2).

The k periodically-spaced masks in a mask distribution are at k ordinalpositions that begin at varying offsets from a first residue position inthe periodically-spaced mask grid 210. For example, in FIG. 2 , the kperiodically-spaced masks of the first mask distribution are located atthe third, the sixth, and the ninth ordinal positions, and begin at anoffset of two from the first residue position in the periodically-spacedmask grid 210. The k periodically-spaced masks of the second maskdistribution are located at the first, the fourth, and the seventhordinal positions, and begin at an offset of zero from the first residueposition in the periodically-spaced mask grid 210. The kperiodically-spaced masks of the third mask distribution are located atthe second, the fifth, and the eighth ordinal positions, and begin at anoffset of one from the first residue position in the periodically-spacedmask grid 210. The k periodically-spaced masks of the fourth maskdistribution are located at the third, the sixth, and the ninth ordinalpositions, and begin at an offset of two from the first residue positionin the periodically-spaced mask grid 210. The k periodically-spacedmasks of the fifth mask distribution are located at the fourth and theseventh ordinal positions, and begin at an offset of three from thefirst residue position in the periodically-spaced mask grid 210.

Masks in the periodically-spaced mask grid 210 are periodic because themasks have regular spacing between them and repeat at regular intervals,i.e., the masks are regularly-spaced repeats. The masks in theperiodically-spaced mask grid 210 are also periodic because the maskshave an ordered pattern.

The masks in the periodically-spaced mask grid 210 can have a latticepattern, a diagonal pattern, a hexagonal pattern, a diamond pattern, arectangle pattern, a square pattern, a triangle pattern, a convexpattern, a concave pattern, and/or a polygonal pattern.

In one implementation, the k periodically-spaced masks of each of themask distributions in the periodically-spaced mask grid 210 have a samestride (e.g., stride = 3 in FIG. 2 ). In another implementation, the kperiodically-spaced masks across the mask distributions in theperiodically-spaced mask grid 210 have a diagonal pattern. In otherimplementations, the stride can be any number, such as 16 or in a rangeof 8 to 64 or any number in or subrange of that range. As used herein,the term “stride” refers to the distance between adjacent masks.

In other implementations, the masks in the periodically-spaced mask grid210 are quasi-periodic, such that the masks have an ordered pattern, butthe masks do not recur at precisely regular intervals.

The discussion now turns to FIGS. 3 and 4 to discuss the details of howthe masks are encoded for processing by the PrimateAI language model.After having described FIGS. 3 and 4 , the discussion will return toFIG. 2 to discuss how the disclosed partially-masked MSA is generated.

Masks

A mask token defines the masks. The mask token is configured to concealor replace the original residue in an MSA onto which the mask token isapplied. The mask token is a special or auxiliary token in the sensethat the mask token is different from the twenty residue tokens that areused to define the twenty naturally-occurring residues. The mask tokenis also different from the gap residue token that is used to define thegap residue. The gap residues are those residues whose identities areunresolved (or unknown), and therefore the gap residues are not reliablyclassified to any of the twenty-one known residues. The gap residues areencoded by the gap residue token.

The mask token can be defined by the same encoding logic that definesthe twenty residue tokens and the gap residue token in a way thatencodes the mask token as the twenty-second residue.

FIG. 3 shows one implementation of one-hot tokens 300 that are definedfor— the twenty residue one-hot vectors 301, 302, 303, 304, 305, 306,307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, and320; the gap residue one-hot vector 321; and the mask one-hot vector322. The one-hot tokens 300 are encoded with a binary vector oftwenty-two bits, with one of the bits being hot (i.e., 1) while otherbeing 0. In some implementations, a one-hot encoder (not depicted)generates the one-hot tokens 300.

FIG. 4 illustrates one implementation of channel embeddings 400 (orlearned embeddings) that are defined for—the twenty residue channelembedding sets 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411,412, 413, 414, 415, 416, 417, 418, 419, and 420; the gap channelembedding set 421; and the mask channel embedding set 422. The channelembeddings 400 span the twenty-one known residues. The channel embeddingset 421 spans the gap residues. The mask channel embedding set 422 spansthe mask residues. The channel embeddings 400 are tensors that have aheight dimension, a width dimension, and a depth dimension, and each setof channel embeddings can include N channel embeddings, where N is aninteger like ninety-four. In some implementations, an embeddingsgenerator (not depicted (e.g., a multi-layer perceptron)) generates thechannel embeddings 400.

In some implementations, the embeddings generator can be trained inconjunction with the PrimateAI language model to learn and generate thechannel embeddings 400. During inference, a lookup table can store amapping between the one-hot tokens 300 and the channel embeddings 400.The lookup table can be accessed during the inference to replace theresidue tokens, the gap token, and the mask token with the correspondingchannel embeddings.

In other implementations, the encoding of the mask token (e.g., one-hotor channel embeddings) can vary depending on a variety of factors.Examples include the location (i.e., residue position) of the mask, theresidue-type on which the mask is applied, the sequence-type on whichthe mask is applied, the sequence number on which the mask is applied,and the species-type of the sequence on which the mask is applied.

In other implementations, the mask token can be encoded using otherschemes. Examples include quantitative or numerical data type,qualitative data type, discreet data type, continuous data type (withlower and upper bounds), integer data type (with lower and upperbounds), nominal data type, ordinal or ranked data type, categoricaldata type, interval data type, and ratio data type. For example, theencoding can be based on, or any combination thereof, multiple bits,real values between 0 and 1, continuous values such as floating pointnumbers, Red, Green, Blue (RGB) values between 0 and 256, hexadecimalvalues of CSS colors (e.g., #F0F8FF), categorical color values of CSScolors, respective values of other CSS property groups and properties,size of a particular dimension (e.g., height and width), a set ofdifferent values and data types, and others.

The discussion now returns to FIG. 2 to discuss how the disclosedpartially-masked MSA is generated.

Partially-Masked MSA

The MSA 220 has p rows and r columns. The p rows correspond to p proteinsequences. The r columns correspond to r residue positions (e.g., r = 16in FIG. 2 ). The periodically-spaced mask grid 210 can have differentnumber of rows and columns (i.e., a different shape) than the MSA 220.In some implementations, the periodically-spaced mask grid 210 can havea same number of rows and columns (i.e., a same shape) as the MSA 220.

The periodically-spaced mask grid 210 can be applied 212 (or overlaid)anywhere on the MSA 220. For example, the periodically-spaced mask grid210 can be applied such that the periodically-spaced mask grid 210 iscentered at a particular column of the MSA 220 that contains aresidue-of-interest 214 (in red) at a position-of-interest 216 (in red).In another example, the periodically-spaced mask grid 210 can be appliedsuch that the periodically-spaced mask grid 210 is placed at aparticular row (e.g., the query sequence like sequence one in FIG. 2 )of the MSA 220 that contains the residue-of-interest 214 at theposition-of-interest 216.

In one implementation, the periodically-spaced mask grid 210 is appliedto a subset of sequences in the MSA 220, spanning a window of sequences222 (e.g., five sequences in FIG. 2 ). In some implementations, theperiodically-spaced mask grid 210 can be applied on the MSA 220 in aleft-flanking manner or a right-flanking manner. In otherimplementations, the periodically-spaced mask grid 210 can be applied onthe MSA 220 on a portion-by-portion basis, traversing portions (e.g.,quadrants) of the MSA 220 simultaneously or sequentially.

Those residues of the MSA 220 onto which the non-mask elements of theperiodically-spaced mask grid 210 are overlaid remain unchanged and arereferred to herein as the unmasked residues. Conversely, those residuesof the MSA 220 onto which the mask elements of the periodically-spacedmask grid 210 are overlaid change to the mask token and are referred toherein as the masked residues.

A combination or aggregation of the unmasked residues and the maskedresidues forms the partially-masked MSA 230. The partially-masked MSA230 can be defined as an MSA that includes some residues that are notmasked (unmasked) and some residues that are masked. Thepartially-masked MSA 230 can also be defined as an MSA that includessome sequences that contain masked residues and some sequences that donot contain any masked residues.

A portion (or patch) of the partially-masked MSA 230 can be cropped (orselected or extracted) to generate a cropped portion 232 (in blue,dashed outline in FIG. 2 ). In some implementations, the cropped portion232 can include: (i) the masked residues in the window of sequences 222,(ii) some unmasked residues that are contiguously adjacent to the maskedresidues within a neighborhood that coincides with (or defines) aboundary of the cropped portion 232, and (iii) portions of someadditional sequences that extend beyond the window of sequences 222 anddo not contain any masked residues.

MSA Cropping, Padding, and Masking

FIG. 5 shows cropping, padding, and masking of MSAs 500 in accordancewith various implementations of the technology disclosed. In FIG. 5 , aresidue-of-interest at a position-of-interest in the query sequence isindicated by an X, mask locations are indicated by black fill, paddingis indicated by grey fill, and crop regions are indicated by red, dashedlines. In these examples, mask stride is three and cropping window widthis six residues.

In panel A, away from the MSA edges, the position-of-interest is at theright side of the center of a crop region. In panel B, a crop region isshifted to the right of the position-of-interest to avoid going over anMSA edge. In panel C, an MSA for a short protein is padded to fill acrop region. In panel D, a crop region is shifted to the right of theposition-of-interest to minimize padding and the MSA is padded to fillthe crop region.

In some implementations, the position-of-interest is randomly sampledfrom positions in the query sequence during training or chosen by a userduring inference. To maximize information about theposition-of-interest, in some implementations, a cropping window isselected with a size of 256 residues such that the position-of-interestis at the center. However, the cropping window can be shifted if theposition-of-interest is near the edge of an MSA to avoid padding zerosand to increase information about the position-of-interest. If the querysequence is shorter than the cropping window, zeros can be padded tofill the window size.

In some implementations, a smaller probability, p_(sample), is assignedto an MSA being sampled during training if the protein length, L, isshorter than the query sequence, for example,

$p_{\text{sample}} \propto \frac{\max( {\min( {L,512} ),64} )}{512}.$

This assignment rebalances the distribution of lengths for UniRef50proteins used for training and for human proteins, and also preventswastage of computation on padding.

The UniRef50 proteins used for training often have short sequences,whereas a majority of human proteins has long sequences. FIG. 16illustrates a comparison between UniRef50 HHblits MSAs and human HHblitsMSAs. Many of the proteins in the UniRef50 HHblits MSAs have a shortsequence, while only a few human proteins among MSAs are short.Accordingly, the sampling of longer UniRef50 proteins during trainingcan be increased, such that the sampled distribution of short and longproteins is closer to the distribution of human proteins. Increasing thesampling of long-sequence UniRef50 proteins also increases computationefficiency. When only using short-sequence UniRef50 proteins as input,the input will be padded up to a fixed input shape, which means that thecomputation during the training process would be wasted on paddingrather than adding gradients to the model optimization.

The probability of sampling non-query sequences to be included in thefirst f sequences of an MSA can also be adjusted (e.g., f= 32). In oneimplementation, the periodically-spaced mask grid 210 is applied in away that penalizes the occurrences of gaps in the first f sequences. Theprobability, p_(mask), of a non-query sequence being masked decreaseswith increasing number of gap tokens, for example,

$N_{\text{gap}},p_{\text{mask}} \propto \frac{( {L - N_{\text{gap}}} )^{2}}{L^{2}}.$

Downsampling of sequences with a considerable number of gaps reduces thefraction of missing data in the MSAs.

MSA Representation

FIG. 6 depicts one implementation of generating 600 the disclosed MSArepresentation. Panel A shows the MSA 220. Panel B shows thepartially-masked MSA 230. In this example, the periodically-spaced maskgrid 210 is applied to the first four sequences of the MSA 220 and has astride of three. The partially-masked MSA 230 is generated as a resultof applying the periodically-spaced mask grid 210 to the MSA 220. Inpanel C, the unmasked residues and the masked residues in thepartially-masked MSA 230 are replaced with corresponding ones of thechannel embeddings 400. In one implementation, the corresponding ones ofthe channel embeddings 400 are summed with position embeddings forresidue columns. The position embeddings can be learned and generatedduring the training of the PrimateAI language model. The sum of thecorresponding ones of the channel embeddings 400 and the positionembeddings are divided into chunks 640. In panel D, the chunks 640 areconcatenated in the channel dimension into a stack 660 and then linearlyprojected 670 to form an MSA representation 680. In someimplementations, the linear projection 670 uses a plurality ofone-dimensional (1D) convolution filters.

The channel embeddings 400 are also referred to herein as learnedembeddings. In one implementation, the masked residues and the unmaskedresidues in the partially-masked MSA 230 are translated into the learnedembeddings by using a look-up table that stores learned embeddingscorresponding to the masked residues and the unmasked residues.

The position embeddings are also referred to herein as residue positionembeddings. The sum of the corresponding ones of the channel embeddings400 and the position embeddings is also referred to herein as anembedded representation of the partially-masked MSA 230. The learnedembeddings are concatenated with the residue position embeddings togenerate the embedded representation.

The embedded representation is chunked into the series of chunks 640.The chunks in the series of chunks are concatenated into the stack 660.

The MSA representation 680 is also referred to herein as a projected (orcompressed) representation of the embedded representation. The projectedrepresentation has m rows and r columns. The stack 660 is translatedinto the projected representation by using convolution operations, inaccordance with one implementation. Note that the projectedrepresentation is not compressed at this stage in themaking-data-smaller sense. The projected representation is “compressed”or “smaller” in comparison to the embedded representation if we did notstack rows, which is why row stacking lowers computational requirements.However, the projected representation is not smaller than the modelinput in terms of feature dimensionality.

In one implementation, the fixed mask pattern is applied to the firstthirty-two sequences of MSAs. The MSA tokens are encoded by learned96-channel embeddings, which are summed with learned 96-channel positionembeddings for residue columns before layer normalization. To reducecomputational requirements, embeddings for the 1024 sequences in MSAsare split into thirty-two chunks, each containing thirty-two sequences,at periodic intervals along the sequence axis. These chunks are thenconcatenated in the channel dimension and mixed by linear projection. Inthe context of this application, chunks can be referred to as differentnon-overlapping rows of the MSA. In other implementations, the MSA canbe “chunked” in other ways, such as column-wise, or some other irregularpattern.

PrimateAI Language Model

FIG. 7 illustrates an example architecture 700 of the PrimateAI languagemodel. The PrimateAI language model comprises a cascade ofaxial-attention blocks 710 (e.g., twelve axial-attention blocks). Thecascade of axial-attention blocks 710 takes the MSA representation 680as input and generates an updated MSA representation 720 as output. Eachaxial-attention block comprises residuals that add a tied row-wise gatedself-attention layer 712, a tied column-wise gated self-attention layer714, and a transition layer 716.

In one implementation, there are twelve heads in the tied row-wise gatedself-attention layer 712. In one implementation, there are twelve headsin the tied column-wise gated self-attention layer 714. Each headgenerates sixty-four channels, totaling 768 channels across twelveheads. In one implementation, the transition layer 716 projects up to3072 channels for GELU activation.

The technology disclosed modified axial-gated self-attention to includetied attention, instead of triangle attention. Triangle attention has ahigh computation cost. Tied attention is the sum of dot-productaffinities, between keys and values, across non-padding rows, followedby division by the square root of the number of non-padding rows, whichreduces computational burden substantially.

The discussion now turns to the disclosed mask revelation.

Mask Revelation

The mask revelation reveals unknown values at other mask locations afterthe cascade of axial-attention blocks 710. The mask revelation gathersfeatures aligned with mask sites. For each masked residue in a row, themask revelation reveals embedded target tokens at other masked locationsin that row.

The mask revelation combines the updated 768-channel MSA representation720 with 96-channel target token embeddings 690 at locations indicatedby a Boolean mask 770 which labels positions of mask tokens. The Booleanmask 770, which is a fixed mask pattern with stride 16, is appliedrow-wise to gather features from the MSA representation and target tokenembedding at mask token locations.

Feature gathering reduces row length from 256 to 16, which drasticallydecreases the computational cost of attention blocks that follow maskrevelation. For each location in each row of the gathered MSArepresentation, the row is concatenated with a corresponding row fromthe gathered target token embedding where that location is also maskedin the target token embedding. The MSA representation and partiallyrevealed target embedding are concatenated in the channel dimension andmixed by a linear projection.

After mask revelation 730, the now-informed MSA representation 740 ispropagated though residual row-wise gated self-attention layers 750, 756and a transition layer 754. The attention is only applied to features atmask locations as residues are known for other positions from the MSArepresentation 680 provided as input to the PrimateAI language model.Thus, attention only needs to be applied at mask locations where thereis new information from mask revelation.

After interpretation of the mask revelations by self-attention, a maskedgather operation 760 collects features from the resulting MSArepresentation at positions where target token embeddings remainedmasked. The gathered MSA representation 772 is translated to predictions790 for 21 candidates in the amino acid and gap token vocabulary by anoutput head 780. The output head 780 comprises a transition layer and aperceptron.

FIG. 8 shows details 800 of the disclosed mask revelation. Maskrevelation allows more information during subsequent training improvingthe accuracy of predicting each residue of interest.

The first step is to gather 804, 830, 862 all the tokens at the masklocations 802, 860 marked by the dots. The term gather is used hereinterchangeably with the term aggregate. This is done for tokens in theupdated MSA representation 720, the periodically-spaced mask grid 210,and the embedded representation (embedding tokens) 690.

In FIG. 8 , the dashed lines and colors show how an MSA tile 806 and anembedding tile 844 are selected. Feature gathering reduces row lengthfrom 256 to 16 (6 to 2 in FIG. 8 ), which drastically decreases thecomputational cost of attention blocks that follow mask revelation. Eachof the gathered representations is tiled or replicated/cloned 808, 830,866 by the number of masks in the rows. In the example shown in FIG. 8 ,there are two masks per row. Therefore, there are two tiles that areconcatenated as clones 810 and 870 as a result of cloning 808 and 866,respectively.

Mask revelation 830 is the removal of all the masks in a tile except forthose at a single position. The top tile of the gathered masks is maskedat the first position-of-interest 834 and unmasked at all the otherpositions-of-interest 836. The second tile is masked at the secondposition-of-interest 838 and unmasked at all the otherpositions-of-interest 832. Mask revelation reveals other tokens in a rowfor each masked position in the row. In some implementations, positionsare masked in the same way in both training and inference. This resultsin higher performance than changing to only masking the position-of-interest during inference. The location of interest’s position in inputchosen to maximize input information because, for example, when thelocation of interest is centered at the mask, then more of the flankingcolumns of the MSA are included in the input that is processed by thePrimateAI language model.

Next, the remaining masks after mask revelation 830 are applied 868 tothe embedding tile 844 to produce cloned and masked embedding tiles 870.The cloned and masked embedding tiles 870 are concatenated 872 with thecloned MSA tiles 810 to generate concatenated tiles 873. Theconcatenated tiles 873 are linearly projected 874 to produce theinformed MSA representation 740.

PrimateAI Language Model Components & Training

FIG. 9 shows various components 900 of the PrimateAI language model, inaccordance with one implementation. The components can include tiedrow-wise gated self-attention, row-wise gate self-attention, andcolumn-wise gated self-attention. The PrimateAI language model can alsouse tied attention. Axial-attention creates independent attention mapsfor each row and column of the input. Sequences in an MSA usually havesimilar three-dimensional structures. Direct coupling analysis exploitsthis fact to learn structural contact information. To leverage thisshared structure, it is beneficial to tie the row attention maps betweenthe sequences in the MSA. As an additional benefit, tied attentionreduces the memory footprint of the row attentions.

In implementations involving recomputation, tied attention reduces thememory footprint of the row attentions from O(ML²) to O(L²). Let M bethe number of rows, d be the hidden dimension and Q_(m), K_(m) be thematrix of queries and keys for the m-th row of input. Tied row attentionis defined, before softmax is applied, to be:

$\frac{\sum_{m = 1}^{M}{Q_{m}K_{m}^{T}}}{\text{λ}( \text{M, d} )}$

The final model uses square root normalization. In otherimplementations, the model can also use mean normalization. In suchimplementations, the denominator 1(M, d) is the normalization constant√d in standard scaled-dot product attention. In such implementations,for tied row attention, two normalization functions are used to preventattention weights linearly scaling with the number of input sequences:1(M, d) = M√d (mean normalization) and 1(A7, d) = √Md (square rootnormalization).

In FIG. 9 , dimensions are shown for sequences, s = 32, residues, r =256, attention heads, h = 12, and channels, c = 64 and c_(MSA) = 768.

In one implementation, the PrimateAI language model can be trained onfour A100 graphical processing units (GPUs). Optimizer steps are for abatch size of 80 MSAs, which is split over four gradient aggregations tofit batches into 40 GB of A100 memory. The PrimateAI language model istrained with the LAMB optimizer using the following parameters: β_1=0.9,β_2=0.999, ∈=10-6, and weight decay of 0.01. Gradients arepre-normalized by division by their global L2 norm before applying theLAMB optimizer. Training is regularized by dropout with probability 0.1,which is applied after activation and before residual connections.

FIG. 17 illustrates the training of the PrimateAI language model usingLAMB optimizer with gradient pre-normalization. Residual blocks arestarted as identity operations, which speeds up convergence and enablesthe PrimateAI language model. “AdamW” refers to ADAM optimizer withweight decay, “ReZeRO” refers to Zero Redundancy Optimizer and “LR”refers to LAMB optimizer with gradient pre-normalization. See, LargeBatch Optimization for Deep Learning Training BERT in 76 minutes, YangYou, Jing Li, Sashank Reddi, et al., International Conference onLearning Representations (ICLR) 2020. As illustrated, the LAMB optimizerwith gradient pre-normalization shows better performance (e.g., higheraccuracy rate over fewer training iterations) and is more effective fora range of learning rates compared to the use of ADAMW optimizer andZero Redundancy Optimizer.

Axial dropout can be applied in self-attention blocks before residualconnections. Post-softmax spatial gating in column-wise attention isfollowed by column-wise dropout while post-softmax spatial gating inrow-wise attention is followed by row-wise dropout. The post-softmaxspatial gating allows for modulation on exponentially normalized scoresor probabilities produced by the softmax.

In one implementation, the PrimateAI language model can be trained for100,000 parameter updates. The learning rate is linearly increased overthe first 5,000 steps from ɳ=5×10⁻⁶ to a peak value of ɳ=5×10⁻⁴, andthen linearly decayed to ɳ=10⁻⁴. Automatic mixed precision (AMP) can beapplied to cast suitable operations from 32-bit to 16-bit precisionduring training and inference. This increases throughput and reducesmemory consumption without affecting performance. In addition, a ZeroRedundancy Optimizer reduced memory usage by sharding optimizer statesacross multiple GPUs.

Revelation Output Head

FIG. 10 shows one implementation of the revelation output head 780 thatcan be used by the disclosed PrimateAI language model. The gathered MSArepresentation 772 can be translated by the output head 780 topredictions 790 for 21 candidates in an amino acid vocabulary includinga gap token. In one implementation, an amino acid vocabulary can beenumerated and the amino acid enumerations are used to index adictionary of learned embeddings. In other implementations, one-hotembeddings of amino acids can be used and combined with linearprojections. In some implementations, the revelation output head 780 cancomprise a transition layer 1002, a gate 1004, a layer normalizationblock 1006, a linear block 1008, a GELU block, and another linear block1012. Dimensions are shown for channels, c_(MSA) = 768, and vocabularysize, v = 21.

Method

FIG. 11 is a computer-implemented method 1100 of the logic flow of thePrimateAI language model, in accordance with one implementation of thetechnology disclosed.

At action 1102, a multiple sequence alignment (MSA) 220 can be accessed.The MSA can have p rows and r columns. The p rows can correspond to pprotein sequences. The r columns can correspond to r residue positions.

At action 1104, a mask grid 210 can be accessed. The mask grid 210 canhave m mask distributions. Each of the m mask distributions can have kperiodically-spaced masks at k ordinal positions that begin at varyingoffsets from a first residue position in the mask grid.

At action 1106, the m mask distributions can be applied to m proteinsequences in the p protein sequences to generate a partially-masked MSA230 that contains masked residues and unmasked residues, where p > m. Invarious implementations, p >= m.

At action 1108, the masked residues and the unmasked residues can betranslated into learned embeddings 400, the learned embeddings 400 canbe concatenated with residue position embeddings to generate an embeddedrepresentation 690 of the partially-masked MSA 230.

At action 1110, the embedded representation 690 can be chunked (orsplit) into a series of chunks 640, chunks in the series of chunks 640can be concatenated into a stack 650, and the stack 650 can betranslated into a compressed representation 680 of the embeddedrepresentation 690. The compressed representation 680 can have m rowsand r columns.

At action 1112, axial-attention 710 can be iteratively (or sequentially)applied across the m rows and the r columns of the compressedrepresentation, and the applied attention can be interleaved (withtransition layers) to generate an updated representation 720 of (orfrom) the compressed representation 680. The updated representation 720can have m rows and r columns.

At action 1114, k updated representation tiles 810 can be aggregatedfrom the updated representation 720. Each of the k updatedrepresentation tiles 810 can contain those updated representationfeatures of the updated representation 720 that correspond to the maskedresidues. Each of the k updated representation tiles can have m rows andk columns. A given column in the k columns of a given updatedrepresentation tile 806 can contain a respective subset of the updatedrepresentation features. The respective subset can be located at a givenordinal position in the k ordinal positions. The given ordinal positioncan be represented by the given column.

At action 1116, k embedding tiles 870 corresponding to the k updatedrepresentation tiles 810 can be aggregated from the embeddedrepresentation 690. Each of the k embedding tiles 844 can contain thoseembedding features in a first chunk of the series of chunks that aretranslations of the masked residues. Each of the k embedding tiles canhave m rows and k columns. A given column in the k columns of a givenembedding tile can contain a respective subset of the embeddingfeatures. The respective subset can be located at a given ordinalposition in the k ordinal positions. The given ordinal position can berepresented by the given column.

At action 1118, k Boolean tiles 834, 838 can be applied to the kembedding tiles to generate k Booleaned (partially revealed) embeddingtiles. Each of the k Boolean tiles can have m rows and k column. Each ofthe k Boolean tiles can cause concealment of a corresponding one of thek columns in a corresponding one of the k embedding tiles, and can causerevelation of other ones of the k columns in the corresponding one ofthe k embedding tiles. Each of the k Booleaned embedding tiles can havem rows and k columns.

At action 1120, the k Booleaned (partially revealed) embedding tiles 870can be concatenated with the k updated representation tiles 810 togenerate k concatenated tiles 873, and the k concatenated tiles 873 canbe translated into k compressed tile representations (informed MSArepresentation 740) of the k concatenated tiles 873. Each of the kcompressed tile representations can have m rows and k columns.

At action 1122, self-attention 750, 754, 756 can be iteratively appliedto the k compressed tile representations 740 to generate interpretationsof those compressed tile features in the k compressed tilerepresentations that correspond to those embedding features in the kembedding tiles that are revealed by the k Boolean tiles.

At action 1124, those interpreted features can be aggregated from theinterpretations that correspond to those embedding features in the kembedding tiles that are concealed by the k Boolean tiles to generate anaggregated representation of the interpretations (gathered MSArepresentation 772). The aggregated representation can have m rows and kcolumns.

At action 1126, the aggregated representation 772 can have translatedinto identities 790 of the masked residues.

System

FIG. 12 is a system 1200 that is configured to implement the PrimateAIlanguage model, in accordance with one implementation of the technologydisclosed.

A memory 1202 can store a multiple sequence alignment (MSA) with aplurality of masked residues.

A chunking logic 1204 can be configured to chunk the MSA into a seriesof chunks.

A first attention logic 1206 can be configured to attend to arepresentation of the series of chunks and produce a first attentionoutput.

A first aggregation logic 1208 can be configured to produce a firstaggregated output that contains those features in the first attentionoutput that correspond to masked residues in the plurality of maskedresidues. The features include elements of an MSA, in oneimplementation, such as one-hot encodings of amino acids in the MSA.

A mask revelation logic 1210 can be configured to produce an informedoutput based on the first aggregated output and a Boolean mask that, ona subset-by-subset basis, alternates between concealing a given subsetof the masked residues and revealing remaining subsets of the maskedresidues.

A second attention logic 1212 can be configured to attend to theinformed output and produce a second attention output based on maskedresidues revealed by the Boolean mask.

A second aggregation logic 1214 can be configured to produce a secondaggregated output that contains those features in the second attentionoutput that correspond to masked residues concealed by the Boolean mask.

An output logic 1216 can be configured to produce identifications of themasked residues based on the second aggregated output.

Objective Indicia of Inventiveness and Non-Obviousness

FIG. 13 shows the performance evaluation 1300 of the language modellingpart of the PrimateAI language model (LM) compared to the replicated VAEpart of EVE (J. Frazer et al., Disease variant prediction with deepgenerative models of evolutionary data. Nature 599, 91-95 (2021)(Evolutionary model of Variant Effect) labelled “EVE*”) model and theircombined score (labelled “PrimateAI LM+EVE*-only”). The performance isfurther compared to a selection of competitive unsupervised methods(ESMlv, SIFT, LIST-S2). In clockwise direction starting from the topleft, the individual panels correspond to evaluation on DDD vs UKBB,Assays, ClinVar, ASD, CHD, DDD and UKBB. For Assays and UKBB, thesummary statistics are given in terms of absolute value (|corr|) ofcorrelation between score and an experimental measure of pathogenicity,i.e., mean phenotype (UKBB) or assays score (Assays). For DDD, wecalculate the P-value of Wilcoxon rank-sum for control and casedistribution over all datasets. For ClinVar, we measure the AUC averagedover all genes.

Evaluation Datasets Saturation Mutagenesis Assays

Performance of the PrimateAI language model is compared using deepmutational scanning assays for the following 9 genes: Amyloid-beta,YAP1, MSH2, SYUA, VKOR1, PTEN, BRCA1, TP53, and ADRB2. A few assays ofthe genes for which the predication scores of some classifiers areunavailable are excluded from the evaluation analysis, including TPMT,RASH, CALM1, UBE2I, SUMO1, TPK1, and MAPK1. Also excluded are assays ofKRAS (due to different transcript sequence), SLCO1B1 (only 137variants), and Amyloid-beta. Performance of the PrimateAI language modelis evaluated by computing the absolute Spearman rank correlation betweenmodel prediction scores and assay scores individually for each assay andthen taking the mean across all assays.

UK Biobank

The UK Biobank (UKBB) dataset contains 61 phenotypes across 100 genes.Evaluating on common variants of all methods reduces the number to 41phenotypes across 42 genes. The absolute Spearman rank correlation iscalculated between the predicted pathogenicity scores and thequantitative phenotype scores for each pair of gene/phenotype. Onlygene/phenotype pairs with at least 10 variants were included in theevaluation (14 phenotypes across 16 genes). This confirmed that theevaluation is robust to this choice of threshold.

ClinVar

Performance of the PrimateAI language model in classifying clinicallabels of ClinVar missense variants as benign or pathogenic isbenchmarked. Both “benign” and “likely benign” labelled variants areconsidered benign, the same for “pathogenic” and “likely pathogenic”labelled variants (both considered pathogenic). To ensure high-qualitylabels, only ClinVar variants with 1-star review status or above(including “criteria provided, single submitter”, “criteria provided,multiple submitters, no conflicts”, “reviewed by expert panel”,“practice guideline”) are included. This reduced the number of variantsfrom 36,705 to 22,165 for the pathogenic and from 41,986 to 39,560 forthe benign class. The area under the receiver operating characteristiccurve for each gene is calculated and then the mean AUC across all genesis reported.

DDD/ASD/CHD De Novo Missense Variants

To evaluate the performance of the deep learning network in clinicalsettings, de novo mutations from published studies for intellectualdisorders, including autism spectrum disorder (ASD) and developmentaldisorders (DDD) are obtained. ASD contained 2,127 patients with at leastone de novo missense (DNM) mutation. Taken together, there are a totalof 3,135 DNM mutations. This reduced to 517 patients with at least oneDNM variant and a total of 558 DNM variants after requiring all methodshad predictions for those variants. In DDD, 17,952 patients had at leastone de novo missense variant (26,880 variants in total), reducing to5,872 patients (6,398 variants) after requiring availability ofpredictions of all methods. A set of DNM variants from patients withcongenital heart disorders (CHD) are obtained, consisting of 1,839 denovo missense variants from 1,342 patients (reducing to 314 variantsfrom 299 patients after requiring availability of predictions of allmethods). For all the three datasets of de novo variants from affectedpatients, a shared set of DNM variants from healthy controls are used,which contains 1,823 DNM variants from 1,215 healthy controls with atleast one DNM variant and collected from multiple studies. It wasreduced to 250 variants (235 patients) after requiring availability ofvariant prediction scores of all methods. For each disease set of DNMs,the Mann-Whitney U test is applied to evaluate how well each classifiercan distinguish the DNM set of patients from that of controls.

Methods for Comparison

Predictions from other methods were evaluated using rank scoresdownloaded from the database for functional prediction dbNSFP4.2a. Toavoid dramatic reductions in the number of common variants, methods withincomplete sets of scores (methods with less than 67 out of 71 millionpossible missense variants in hg38) are removed, except Polyphen2 due toits widespread adoption. We included the following methods (methodabbreviation) for comparison: BayesDel noAF (BayesDel), CADD_raw (CADD),DANN, DEOGEN2, LIST-S2, M-CAP, MutationTaster_converted(MutationTaster), PROVEAN_ converted (PROVEAN), Polyphen2_HVAR(Polyphen2; due to better performance then Polyphen2 HDIV), PrimateAI,Revel (REVEL), SIFT_converted (SIFT), VEST4, fathmm-MKL_coding(fathmm-MKL; highest performance among the fathmm models for givenbenchmarks).

Applying EVE to More Proteins

In the original publication, EVE is only applied to a small set ofdisease-associated genes in ClinVar. To generate the disclosed languagemodel-based training data set, it is essential to expand the predictionsof EVE to as many proteins as possible. Due to unavailability of EVEsource code, a similar method DeepSequence is applied and convertedDeepSequence scores into EVE scores by fitting Gaussian mixture models.An up-to-date version of UniRef100 is used, but otherwise followed thealignment depth and sequence coverage filtering steps described in EVE.At least 1 prediction in 18,920 proteins and a total of 50.2 M predictedvariants out of 71.2 M possible missense variants are achieve. Tovalidate the disclosed replication, the replicated EVE models areevaluated using published variants from EVE. Scores from the replicatedEVE model result in comparable performance to the published EVE softwareon all benchmarking datasets, e.g., both methods achieve 0.41 meanabsolute correlation on Assays and 0.22 mean absolute correlation forUKBB.

Benchmarking PrimateAI Language Model Against Other Sequence-Only ModelsFor Pathogenicity Predictions

The PrimateAI language model falls into a class of methods only trainedto model proteins sequences but performing surprisingly well aspathogenicity predictors. Despite not achieving the overall bestperformance by themselves, they make crucial features or components inclassifiers incorporating more diverse data. FIG. 13 summarizes theevaluation performance of the PrimateAI language model against othersuch sequence-only methods for pathogenicity prediction: ESM1v, EVE,LIST-S2, and SIFT. Our language model outperforms another language modelESM1v on all the testing datasets except assays using only 1/50^(th) ofthe training time. This is particularly striking as PrimateAI LM doesnot rely on any fine-tuning on assays.

Combining PrimateAI Language Model With EVE

Language models are trained to model the entire universe of proteins.EVE trains a separate model for each human protein and all similarsequences. This and the differences in model architecture and trainingalgorithms suggest that the models extract distinct features from theirinput. Therefore, we expected that the scores from EVE and our languagemodel to be complementary and that combining scores may result inimproved performance. We found that simply taking the mean of theirpathogenicity scores already performs better than any of the two methodsalone. More elaborate combinations, e.g., using ridge regression, didnot lead to any further improvements. The resulting performance is shownin FIG. 13 , where the combined score leads to a performance gain of6.6% (or 6.8%) in mean correlation on assays compared to the PrimateAILM (or compared to replicated EVE), 1.4% (or 1.7%) improvement mean AUCon ClinVar and increases in P-value by 11% (29%) for DDD, 3% (26%) forASD and 17% (23%) for CHD.

Top-1 Training Accuracy

FIG. 14 depicts the Top-1 training accuracy 1400 of the PrimateAIlanguage model. An ensemble of six PrimateAI language model networks wastrained with different random seeds for training data sampling and modelparameter initialization. Their top-1 accuracies during training areshown in FIG. 14 for mask locations in the query sequence and allsequences in UniRef50 MSAs. Top-1 accuracy for the query sequence ismuch lower than for all sequences as the query sequence does not containgap tokens, which are easier to predict than residues because gap tokensoften form long and contiguous segments in MSAs. The PrimateAI languagemodel accuracy on query sequences continues to improve with training. Insome implementations, convergence can be accelerated by adding auxiliarylosses to each layer of the PrimateAI language model.

Entropy and Pathogenicity Score

Scores of the PrimateAI language model can be tabulated for futurereference, rather than re-running the model every time its scores areneeded. For example, the PrimateAI language model’s fill-in-the-blankpredictions can provided for locations of interest at every site in19,071 human proteins, totaling predictions for 2,057,437,040 variantsat 108,286,160 positions. A person skilled in the art will appreciatethat these numbers would change, for example, if the small number ofhuman proteins that were not included here were included. In someimplementations, the PrimateAI language model can be ensembled toproduced averaged scores that have higher performance than individualmodel scores. For example, each prediction can be made by an ensemble ofsix models, with each model contributing at least four inferences withdifferent random seeds for sampling and ordering of sequences in humanMSAs. Inferences logits can be averaged by taking means of predictionsgrouped by random seed, and then taking the mean of the means.

Pathogenicity prediction of a variant can be evaluated using therelative values of logits for reference and alternative amino acids, orevaluated by subtracting the logit value for the reference amino acidfrom the logit value for the alternative amino acid. The probabilitiesare normalized over all possible residues disregarding the gap token,such that ∑_(r)p_(r)=1 with probability p_(r) of the r^(th) residueobtained from the ensembled logits. The log difference captures howunlikely the variant amino acid is compared to the reference amino acid.However, the score does not consider the prediction of the other 18possible amino acids, which contain information about the languagemodels internal estimate of protein site conservation as well asconvergence of the language model. The entropy was used evaluated overamino acid predictions S = -∑_(rpr) log (p_(r)) with probability p_(r)of the r^(th) residue to capture a variant agnostic site-dependentcontribution to the pathogenicity score. Specifically, a score, s_(alt),for the alternative residue at a given site is given by the usual logdifference of the alt and reference logit at that site minus the entropyover amino acids at the given site, i.e., s_(alt)=log (p_(alt)) - log(p_(ref)) - S.

The entropy term is small whenever the probability over all amino acidsis dominated by a single term and large whenever the model is uncertainabout the residues and assigns multiple residues high values.Physically, in this case the site is associated with little conservationand likely to mutate. This should lead to less pathogenic signal.Adjusting the scores by entropy incorporates a model internal estimateof amino acid conservation. A given log difference between residue andreference will be considered as more pathogenic whenever it isassociated with a highly conserved site. The score adjustmentadditionally incorporates the lack of convergence associated with aheavily undertrained model.

“Logic” (e.g., masking logic), as used herein, can be implemented in theform of a computer product including a non-transitory computer readablestorage medium with computer usable program code for performing themethod steps described herein. The “logic” can be implemented in theform of an apparatus including a memory and at least one processor thatis coupled to the memory and operative to perform exemplary methodsteps. The “logic” can be implemented in the form of means for carryingout one or more of the method steps described herein; the means caninclude (i) hardware module(s), (ii) software module(s) executing on oneor more hardware processors, or (iii) a combination of hardware andsoftware modules; any of (i)-(iii) implement the specific techniques setforth herein, and the software modules are stored in a computer readablestorage medium (or multiple such media). In one implementation, thelogic implements a data processing function. The logic can be a generalpurpose, single core or multicore, processor with a computer programspecifying the function, a digital signal processor with a computerprogram, configurable logic such as an FPGA with a configuration file, aspecial purpose circuit such as a state machine, or any combination ofthese. Also, a computer program product can embody the computer programand configuration file portions of the logic.

Computer System

FIG. 15 shows a computer system 1500 that can be used for compilationand runtime execution of the PrimateAI language model. Computer system1500 includes at least one central processing unit (CPU) 1572 thatcommunicates with a number of peripheral devices via bus subsystem 1555.These peripheral devices can include a storage subsystem 1510 including,for example, memory devices and a file storage subsystem 1536, userinterface input devices 1538, user interface output devices 1576, and anetwork interface subsystem 1574. The input and output devices allowuser interaction with computer system 1500. Network interface subsystem1574 provides an interface to outside networks, including an interfaceto corresponding interface devices in other computer systems.

In one implementation, the pathogenicity predictor 150 (e.g., thePrimateAI language model) is communicably linked to the storagesubsystem 1510 and the user interface input devices 1538.

User interface input devices 1538 can include a keyboard; pointingdevices such as a mouse, trackball, touchpad, or graphics tablet; ascanner; a touch screen incorporated into the display; audio inputdevices such as voice recognition systems and microphones; and othertypes of input devices. In general, use of the term “input device” isintended to include all possible types of devices and ways to inputinformation into computer system 1500.

User interface output devices 1576 can include a display subsystem, aprinter, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem can include an LED display, a cathode raytube (CRT), a flat-panel device such as a liquid crystal display (LCD),a projection device, or some other mechanism for creating a visibleimage. The display subsystem can also provide a non-visual display suchas audio output devices. In general, use of the term “output device” isintended to include all possible types of devices and ways to outputinformation from computer system 1500 to the user or to another machineor computer system.

Storage subsystem 1510 stores programming and data constructs thatprovide the functionality of some or all of the modules and methodsdescribed herein. These software modules are generally executed byprocessors 1578.

Processors 1578 can be graphics processing units (GPUs),field-programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), and/or coarse-grained reconfigurable architectures(CGRAs). Processors 1578 can be hosted by a deep learning cloud platformsuch as Google Cloud Platform®, Xilinx®, and Cirrascale®. Examples ofprocessors 1578 include Google’s Tensor Processing Unit (TPU)®,rackmount solutions like GX4 Rackmount Series®, GX15 Rackmount Series®,NVIDIA DGX-1®, Microsoft’ Stratix V FPGA®, Graphcore’s IntelligentProcessor Unit (IPU)®, Qualcomm’s Zeroth Platform® with SnapdragonProcessors®, NVIDIA’s Volta®, NVIDIA’s DRIVE PX®, NVIDIA’s JETSONTX1/TX2 MODULE®, Intel’s Nirvana®, Movidius VPU®, Fujitsu DPI®, ARM’sDynamicIQ®, IBM TrueNorth®, Lambda GPU Server with Testa V100s®, andothers.

Memory subsystem 1522 used in the storage subsystem 1510 can include anumber of memories including a main random access memory (RAM) 1532 forstorage of instructions and data during program execution and a readonly memory (ROM) 1534 in which fixed instructions are stored. A filestorage subsystem 1536 can provide persistent storage for program anddata files, and can include a hard disk drive, a floppy disk drive alongwith associated removable media, a CD-ROM drive, an optical drive, orremovable media cartridges. The modules implementing the functionalityof certain implementations can be stored by file storage subsystem 1536in the storage subsystem 1510, or in other machines accessible by theprocessor.

Bus subsystem 1555 provides a mechanism for letting the variouscomponents and subsystems of computer system 1500 communicate with eachother as intended. Although bus subsystem 1555 is shown schematically asa single bus, alternative implementations of the bus subsystem can usemultiple busses.

Computer system 1500 itself can be of varying types including a personalcomputer, a portable computer, a workstation, a computer terminal, anetwork computer, a television, a mainframe, a server farm, awidely-distributed set of loosely networked computers, or any other dataprocessing system or user device. Due to the ever-changing nature ofcomputers and networks, the description of computer system 1500 depictedin FIG. 15 is intended only as a specific example for purposes ofillustrating the preferred implementations of the present invention.Many other configurations of computer system 1500 are possible havingmore or less components than the computer system depicted in FIG. 15 .

Clauses

The technology disclosed can be practiced as a system, method, orarticle of manufacture. One or more features of an implementation can becombined with the base implementation. Implementations that are notmutually exclusive are taught to be combinable. One or more features ofan implementation can be combined with other implementations. Thisdisclosure periodically reminds the user of these options. Omission fromsome implementations of recitations that repeat these options should notbe taken as limiting the combinations taught in the preceding sections -these recitations are hereby incorporated forward by reference into eachof the following implementations.

One or more implementations and clauses of the technology disclosed orelements thereof can be implemented in the form of a computer product,including a non-transitory computer-readable storage medium withcomputer usable program code for performing the method steps indicated.Furthermore, one or more implementations and clauses of the technologydisclosed or elements thereof can be implemented in the form of anapparatus including a memory and at least one processor that is coupledto the memory and operative to perform exemplary method steps. Yetfurther, in another aspect, one or more implementations and clauses ofthe technology disclosed or elements thereof can be implemented in theform of means for carrying out one or more of the method steps describedherein; the means can include (i) hardware module(s), (ii) softwaremodule(s) executing on one or more hardware processors, or (iii) acombination of hardware and software modules; any of (i)-(iii) implementthe specific techniques set forth herein, and the software modules arestored in a computer-readable storage medium (or multiple such media).

The clauses described in this section can be combined as features. Inthe interest of conciseness, the combinations of features are notindividually enumerated and are not repeated with each base set offeatures. The reader will understand how features identified in theclauses described in this section can readily be combined with sets ofbase features identified as implementations in other sections of thisapplication. These and other features, aspects, and advantages of thetechnology disclosed will become apparent from the following detaileddescription of illustrative implementations thereof, which is to be readin connection with the accompanying drawings. These clauses are notmeant to be mutually exclusive, exhaustive, or restrictive; and thetechnology disclosed is not limited to these clauses but ratherencompasses all possible combinations, modifications, and variationswithin the scope of the claimed technology and its equivalents.

Other implementations of the clauses described in this section caninclude a non-transitory computer-readable storage medium storinginstructions executable by a processor to perform any of the clausesdescribed in this section. Yet another implementation of the clausesdescribed in this section can include a system including memory and oneor more processors operable to execute instructions, stored in thememory, to perform any of the clauses described in this section.

We disclose the following clauses:

Clause Set 1

1. A computer-implemented method of variant pathogenicity prediction,including:

-   accessing a multiple sequence alignment that aligns a query residue    sequence to a plurality of non-query residue sequences;-   applying a set of periodically-spaced masks to a first set of    residues at a first set of positions in the multiple sequence    alignment, wherein the first set of residues includes a    residue-of-interest at a position-of-interest in the query residue    sequence;-   cropping a portion of the multiple sequence alignment that includes    -   (i) the set of periodically-spaced masks at the first set of        positions, and    -   (ii) a second set of residues at a second set of positions in        the multiple sequence alignment to which the set of        periodically-spaced masks is not applied; and generating a        pathogenicity prediction for a variant at the        position-of-interest based on the portion of the multiple        sequence alignment.

2. The computer-implemented method of clause 1, wherein the multiplesequence alignment aligns the query residue sequence to the plurality ofnon-query residue sequences along a per-position dimension and along aper-sequence dimension.

3. The computer-implemented method of clause 2, wherein the set ofperiodically-spaced masks is applied along the per-sequence dimensionwithin a window of sequences in the multiple sequence alignment.

4. The computer-implemented method of clause 3, wherein the set ofperiodically-spaced masks is applied along the per-position dimensionwithin a window of positions across the window of sequences in themultiple sequence alignment.

5. The computer-implemented method of clause 4, wherein the portionspans the window of positions across the multiple sequence alignment.

6. The computer-implemented method of clause 4, wherein the portionspans the window of positions across a subset of sequences in themultiple sequence alignment.

7. The computer-implemented method of clause 1, wherein the portion hasa predetermined width and a predetermined height.

8. The computer-implemented method of clause 7, wherein the portion ispadded to compensate for multiple sequence alignments that have widthssmaller the predetermined width of the portion.

9. The computer-implemented method of clause 7, wherein the portion ispadded to compensate for multiple sequence alignments that have heightssmaller the predetermined heights of the portion.

10. The computer-implemented method of clause 2, wherein the set ofperiodically-spaced masks is distributed along the per-sequencedimension into subsets of periodically-spaced masks.

11. The computer-implemented method of clause 10, wherein the subsets ofperiodically-spaced masks correspond to sequences in the window ofsequences.

12. The computer-implemented method of clause 11, wherein successivemasks in a subset of periodically-spaced masks corresponding to a givensequence in the window of sequences are spaced apart by unmaskedresidues in the given sequence.

13. The computer-implemented method of clause 12, wherein a number ofthe unmasked residues by which the successive masks are spaced apart issame across the sequences in the window of sequences.

14. The computer-implemented method of clause 12, wherein a number ofthe unmasked residues by which the successive masks are spaced apartvaries across the sequences in the window of sequences.

15. The computer-implemented method of clause 12, wherein a startingposition in a given sequence at which a corresponding subset ofperiodically-spaced masks begins varies between the sequences in thewindow of sequences.

16. The computer-implemented method of clause 12, wherein the startingposition follows a diagonal pattern across the sequences in the windowof sequences.

17. The computer-implemented method of clause 14, wherein the startingposition follows a diagonal pattern that begins to repeat at least onceacross the sequences in the window of sequences.

18. The computer-implemented method of clause 17, wherein the startingposition follows a diagonal pattern that repeats at least once acrossthe sequences in the window of sequences.

19. The computer-implemented method of clause 1, wherein the set ofperiodically-spaced masks has a pattern.

20. The computer-implemented method of clause 19, wherein the pattern isa diagonal pattern.

21. The computer-implemented method of clause 19, wherein the pattern isa hexagonal pattern.

22. The computer-implemented method of clause 19, wherein the pattern isa diamond pattern.

23. The computer-implemented method of clause 19, wherein the pattern isa rectangle pattern.

24. The computer-implemented method of clause 19, wherein the pattern isa square pattern.

25. The computer-implemented method of clause 19, wherein the pattern isa triangle pattern.

26. The computer-implemented method of clause 19, wherein the pattern isa convex pattern.

27. The computer-implemented method of clause 19, wherein the pattern isa concave pattern.

28. The computer-implemented method of clause 19, wherein the pattern isa polygonal pattern.

29. The computer-implemented method of clause 19, further includingright-shifting a cropping window used for the cropping to minimizepadding of the portion.

30. The computer-implemented method of clause 29, further includingleft-shifting the cropping window to minimize the padding of theportion.

31. The computer-implemented method of clause 1, further includingconfiguring the cropping window to position the position-of-interest ina center column of the portion.

32. The computer-implemented method of clause 31, further includingconfiguring the cropping window to position the position-of-interestadjacent to the center column.

33. The computer-implemented method of clause 1, further includingsubstituting, in the portion, the set of periodically-spaced masks atthe first set of positions with learned mask embeddings, andsubstituting, in the portion, and the second set of residues at thesecond set of positions with learned residue embeddings.

34. The computer-implemented method of clause 33, wherein a one-hotencoding generator generates the learned mask embeddings and the learnedresidue embeddings.

35. The computer-implemented method of clause 34, wherein the learnedmask embeddings and the learned residue embeddings are selected from alook-up table.

36. The computer-implemented method of clause 1, further includingsubstituting, in the portion, the set of periodically-spaced masks atthe first set of positions and the second set of residues at the secondset of positions with learned position embeddings.

37. The computer-implemented method of clause 36, further includingchunking the portion with the learned mask embeddings, the learnedresidue embeddings, and the learned position embeddings into a pluralityof chunks.

38. The computer-implemented method of clause 37, further includingprocessing the plurality of chunks as an aggregate and generating analternative representation of the portion.

39. The computer-implemented method of clause 38, wherein a linearprojection layer uses a filter bank of 1x1 convolutions to process theplurality of chunks as the aggregate and generate the alternativerepresentation of the portion.

40. The computer-implemented method of clause 39, further includingprocessing the alternative representation of the portion through acascade of attention blocks to generate an updated alternativerepresentation of the portion.

41. The computer-implemented method of clause 40, wherein attentionblocks in the cascade of attention blocks use self-attention.

42. The computer-implemented method of clause 41, wherein each of theattention blocks includes a tied row-wise gate self-attention, followedby a column-wise gated self-attention, and followed by a transitionlogic.

43. The computer-implemented method of clause 40, wherein the attentionblocks use cross-attention.

44. The computer-implemented method of clause 40, wherein a maskrevelation block processes the updated alternative representation of theportion and generates an informed alternative representation of theportion.

45. The computer-implemented method of clause 44, wherein the maskrevelation block gathers features aligned with masked locations in arow, and for each mask in the row reveals, embedded target tokens atother masked locations in the row.

46. The computer-implemented method of clause 44, wherein a mask gatherblock processes the informed alternative representation of the portionand generates a gathered alternative representation of the portion.

47. The computer-implemented method of clause 46, wherein the maskgather block processes the informed alternative representation through acascade of transition logic and row-wise gated self-attention blocksthat gather features where target embeddings remained masked.

48. The computer-implemented method of clause 47, wherein an outputblock processes the gathered alternative representation of the portionand predicts identities of residues masked by the set ofperiodically-spaced masks.

49. The computer-implemented method of clause 48, wherein the outputblock includes a transition logic and a perceptron logic.

50. The computer-implemented method of clause 48, wherein a probabilityof applying a subset of periodically-spaced masks to a non-sequence inthe window of sequences is proportional to (1 - a number of gap tokensin the non-sequence) ^2.

51. The computer-implemented method of clause 1, further includinggenerating the pathogenicity prediction for the variant based on adifference between a log probability of the variant and a logprobability of a corresponding reference amino acid less an entropyevaluated over amino acid-wise predictions.

Clause Set 2

1. A computer-implemented method, including:

-   accessing a multiple sequence alignment (MSA), wherein the MSA has p    rows and r columns, wherein the p rows correspond to p protein    sequences, and wherein the r columns correspond to r residue    positions;-   accessing a mask grid, wherein the mask grid has m mask    distributions, and wherein each of the m mask distributions has k    periodically-spaced masks at k ordinal positions that begin at    varying offsets from a first residue position in the mask grid;-   applying the m mask distributions to m protein sequences in the p    protein sequences to generate a partially-masked MSA that contains    masked residues and unmasked residues, where p > m;-   translating the masked residues and the unmasked residues into    learned embeddings, concatenating the learned embeddings with    residue position embeddings to generate an embedded representation    of the partially-masked MSA;-   chunking the embedded representation into a series of chunks,    concatenating chunks in the series of chunks into a stack, and    translating the stack into a compressed representation of the    embedded representation, wherein the compressed representation has m    rows and r columns;-   iteratively applying axial-attention across the m rows and the r    columns of the compressed representation, and interleaving the    applied attention to generate an updated representation of the    compressed representation, wherein the updated representation has m    rows and r columns;-   aggregating, from the updated representation, k updated    representation tiles, wherein each of the k updated representation    tiles contains those updated representation features of the updated    representation that correspond to the masked residues, wherein each    of the k updated representation tiles has m rows and k columns,    wherein a given column in the k columns of a given updated    representation tile contains a respective subset of the updated    representation features, wherein the respective subset is located at    a given ordinal position in the k ordinal positions, and wherein the    given ordinal position is represented by the given column;-   aggregating, from the embedded representation, k embedding tiles    corresponding to the k updated representation tiles, wherein each of    the k embedding tiles contains those embedding features in a first    chunk of the series of chunks that are translations of the masked    residues, wherein each of the k embedding tiles has m rows and k    columns, wherein a given column in the k columns of a given    embedding tile contains a respective subset of the embedding    features, wherein the respective subset is located at a given    ordinal position in the k ordinal positions, and wherein the given    ordinal position is represented by the given column;-   applying k Boolean tiles to the k embedding tiles to generate k    Booleaned embedding tiles, wherein each of the k Boolean tiles has m    rows and k columns, wherein each of the k Boolean tiles causes    concealment of a corresponding one of the k columns in a    corresponding one of the k embedding tiles, and causes revelation of    other ones of the k columns in the corresponding one of the k    embedding tiles, and wherein each of the k Booleaned embedding tiles    has m rows and k columns;-   concatenating the k Booleaned embedding tiles with the k updated    representation tiles to generate k concatenated tiles, and    translating the k concatenated tiles into k compressed tile    representations of the k concatenated tiles, wherein each of the k    compressed tile representations has m rows and k columns;-   iteratively applying self-attention to the k compressed tile    representations to generate interpretations of those compressed tile    features in the k compressed tile representations that correspond to    those embedding features in the k embedding tiles that are revealed    by the k Boolean tiles;-   aggregating those interpreted features from the interpretations that    correspond to those embedding features in the k embedding tiles that    are concealed by the k Boolean tiles to generate an aggregated    representation of the interpretations, wherein the aggregated    representation has m rows and k columns; and-   translating the aggregated representation into identities of the    masked residues.

2. The computer-implemented of clause 1, further including using aone-hot encoding scheme to translate twenty naturally-occurringresidues, a gap residue, and a mask into respective one-hot encodedvectors.

3. The computer-implemented of clause 2, further including training aneural network to generate respective learned embeddings for therespective one-hot encoded vectors.

4. The computer-implemented of clause 3, wherein the masked residues andthe unmasked residues are translated into the learned embeddings basedon a lookup table that maps the respective one-hot encoded vectors tothe respective learned embeddings.

5. The computer-implemented of clause 4, wherein the residue positionembeddings specify an order in which residues are arranged in the pprotein sequences.

6. The computer-implemented of clause 1, wherein the chunks areconcatenated into the stack along a channel dimension.

7. The computer-implemented of clause 1, wherein the stack is translatedinto the compressed representation by processing the stack through alinear projection.

8. The computer-implemented of clause 7, wherein the linear projectionuses a plurality of one-dimensional (1D) convolution filters.

9. The computer-implemented of clause 8, wherein the k concatenatedtiles are translated into the k compressed tile representations byprocessing the k concatenated tiles through the linear projection.

10. The computer-implemented of clause 1, wherein the aggregatedrepresentation is translated into the identities of the masked residuesby processing the aggregated representation through a revelation outputhead.

11. The computer-implemented of clause 1, wherein p = m.

12. The computer-implemented of clause 1, wherein each of the k Booleantiles causes concealment of the corresponding one of the k columns inthe corresponding one of the k embedding tiles, and causes revelation ofat least some of the other ones of the k columns in the correspondingone of the k embedding tiles.

13. The computer-implemented of clause 1, wherein each of the k Booleantiles causes concealment of a corresponding subset of the k columns inthe corresponding one of the k embedding tiles, and causes revelation ofat least some of the other ones of the k columns in the correspondingone of the k embedding tiles.

14. The computer-implemented of clause 1, wherein the kperiodically-spaced masks of at least some of the m mask distributionsbegin at a same offset from the first residue position.

15. A system, comprising:

-   memory storing a multiple sequence alignment (MSA) with a plurality    of masked residues;-   chunking logic configured to chunk the MSA into a series of chunks;-   first attention logic configured to attend to a representation of    the series of chunks and produce a first attention output;-   first aggregation logic configured to produce a first aggregated    output that contains those features in the first attention output    that correspond to masked residues in the plurality of masked    residues;-   mask revelation logic configured to produce an informed output based    on the first aggregated output and a Boolean mask that, on a    subset-by-subset basis, alternates between concealing a given subset    of the masked residues and revealing remaining subsets of the masked    residues;-   second attention logic configured to attend to the informed output    and produce a second attention output based on masked residues    revealed by the Boolean mask;-   second aggregation logic configured to produce a second aggregated    output that contains those features in the second attention output    that correspond to masked residues concealed by the Boolean mask;    and-   output logic configured to produce identifications of the masked    residues based on the second aggregated output.

16. The system of clause 15, wherein the first attention logic usesaxial-attention.

17. The system of clause 15, wherein the second attention logic usesself-attention.

18. A computer-implemented method, including:

-   accessing a multiple sequence alignment (MSA), wherein the MSA has p    rows and r columns, wherein the p rows correspond to p protein    sequences, and wherein the r columns correspond to r residue    positions;-   accessing a mask grid, wherein the mask grid has m mask    distributions, and wherein each of the m mask distributions has k    periodically-spaced masks at k ordinal positions;-   applying the m mask distributions to m protein sequences in the p    protein sequences to generate a partially-masked MSA that contains    masked residues and unmasked residues, where p > m;-   translating the masked residues and the unmasked residues into    learned embeddings, concatenating the learned embeddings with    residue position embeddings to generate an embedded representation    of the partially-masked MSA;-   chunking the embedded representation into a series of chunks,    concatenating chunks in the series of chunks into a stack, and    translating the stack into a compressed representation of the    embedded representation;-   iteratively applying axial-attention across the m rows and the r    columns of the compressed representation, and interleaving the    applied attention to generate an updated representation of the    compressed representation;-   aggregating, from the updated representation, k updated    representation tiles, wherein each of the k updated representation    tiles contains those updated representation features of the updated    representation that correspond to the masked residues;-   aggregating, from the embedded representation, k embedding tiles    corresponding to the k updated representation tiles, wherein each of    the k embedding tiles contains those embedding features in a first    chunk of the series of chunks that are translations of the masked    residues;-   applying k Boolean tiles to the k embedding tiles to generate k    Booleaned embedding tiles, wherein each of the k Boolean tiles    causes concealment of a corresponding one of the k columns in a    corresponding one of the k embedding tiles, and causes revelation of    other ones of the k columns in the corresponding one of the k    embedding tiles;-   concatenating the k Booleaned embedding tiles with the k updated    representation tiles to generate k concatenated tiles, and    translating the k concatenated tiles into k compressed tile    representations of the k concatenated tiles;-   iteratively applying self-attention to the k compressed tile    representations to generate interpretations of those compressed tile    features in the k compressed tile representations that correspond to    those embedding features in the k embedding tiles that are revealed    by the k Boolean tiles;-   aggregating those interpreted features from the interpretations that    correspond to those embedding features in the k embedding tiles that    are concealed by the k Boolean tiles to generate an aggregated    representation of the interpretations; and-   translating the aggregated representation into identities of the    masked residues.

19. The computer-implemented of clause 18, wherein the kperiodically-spaced masks of at least some of the m mask distributionsbegin at varying offsets from a first residue position in mask grid.

20. The computer-implemented of clause 19, wherein the kperiodically-spaced masks of at least some of the m mask distributionsbegin at a same offset from the first residue position.

21. The computer-implemented of clause 18, wherein the compressedrepresentation has m rows and r columns.

22. The computer-implemented of clause 18, wherein the updatedrepresentation has m rows and r columns.

23. The computer-implemented of clause 18, wherein each of the k updatedrepresentation tiles has m rows and k columns, wherein a given column inthe k columns of a given updated representation tile contains arespective subset of the updated representation features, wherein therespective subset is located at a given ordinal position in the kordinal positions, and wherein the given ordinal position is representedby the given column.

24. The computer-implemented of clause 18, wherein each of the kembedding tiles has m rows and k columns, wherein a given column in thek columns of a given embedding tile contains a respective subset of theembedding features, wherein the respective subset is located at a givenordinal position in the k ordinal positions, and wherein the givenordinal position is represented by the given column.

25. The computer-implemented of clause 18, wherein each of the k Booleantiles has m rows and k columns.

26. The computer-implemented of clause 18, wherein each of the kBooleaned embedding tiles has m rows and k columns.

27. The computer-implemented of clause 18, wherein each of the kcompressed tile representations has m rows and k columns.

28. The computer-implemented of clause 18, wherein the aggregatedrepresentation has m rows and k columns.

What is claimed is:
 1. A computer-implemented method, including:accessing a multiple sequence alignment (MSA), wherein the MSA has prows and r columns, wherein the p rows correspond to p proteinsequences, and wherein the r columns correspond to r residue positions;accessing a mask grid, wherein the mask grid has m mask distributions,and wherein each of the m mask distributions has k periodically-spacedmasks at k ordinal positions that begin at varying offsets from a firstresidue position in the mask grid; applying the m mask distributions tom protein sequences in the p protein sequences to generate apartially-masked MSA that contains masked residues and unmaskedresidues, where p > m; translating the masked residues and the unmaskedresidues into learned embeddings, concatenating the learned embeddingswith residue position embeddings to generate an embedded representationof the partially-masked MSA; chunking the embedded representation into aseries of chunks, concatenating chunks in the series of chunks into astack, and translating the stack into a compressed representation of theembedded representation, wherein the compressed representation has mrows and r columns; iteratively applying axial-attention across the mrows and the r columns of the compressed representation, andinterleaving the applied attention to generate an updated representationof the compressed representation, wherein the updated representation hasm rows and r columns; aggregating, from the updated representation, kupdated representation tiles, wherein each of the k updatedrepresentation tiles contains those updated representation features ofthe updated representation that correspond to the masked residues,wherein each of the k updated representation tiles has m rows and kcolumns, wherein a given column in the k columns of a given updatedrepresentation tile contains a respective subset of the updatedrepresentation features, wherein the respective subset is located at agiven ordinal position in the k ordinal positions, and wherein the givenordinal position is represented by the given column; aggregating, fromthe embedded representation, k embedding tiles corresponding to the kupdated representation tiles, wherein each of the k embedding tilescontains those embedding features in a first chunk of the series ofchunks that are translations of the masked residues, wherein each of thek embedding tiles has m rows and k columns, wherein a given column inthe k columns of a given embedding tile contains a respective subset ofthe embedding features, wherein the respective subset is located at agiven ordinal position in the k ordinal positions, and wherein the givenordinal position is represented by the given column; applying k Booleantiles to the k embedding tiles to generate k Booleaned embedding tiles,wherein each of the k Boolean tiles has m rows and k columns, whereineach of the k Boolean tiles causes concealment of a corresponding one ofthe k columns in a corresponding one of the k embedding tiles, andcauses revelation of other ones of the k columns in the correspondingone of the k embedding tiles, and wherein each of the k Booleanedembedding tiles has m rows and k columns; concatenating the k Booleanedembedding tiles with the k updated representation tiles to generate kconcatenated tiles, and translating the k concatenated tiles into kcompressed tile representations of the k concatenated tiles, whereineach of the k compressed tile representations has m rows and k columns;iteratively applying self-attention to the k compressed tilerepresentations to generate interpretations of those compressed tilefeatures in the k compressed tile representations that correspond tothose embedding features in the k embedding tiles that are revealed bythe k Boolean tiles; aggregating those interpreted features from theinterpretations that correspond to those embedding features in the kembedding tiles that are concealed by the k Boolean tiles to generate anaggregated representation of the interpretations, wherein the aggregatedrepresentation has m rows and k columns; and translating the aggregatedrepresentation into identities of the masked residues.
 2. Thecomputer-implemented of claim 1, further including using a one-hotencoding scheme to translate twenty naturally-occurring residues, a gapresidue, and a mask into respective one-hot encoded vectors.
 3. Thecomputer-implemented of claim 2, further including training a neuralnetwork to generate respective learned embeddings for the respectiveone-hot encoded vectors.
 4. The computer-implemented of claim 3, whereinthe masked residues and the unmasked residues are translated into thelearned embeddings based on a lookup table that maps the respectiveone-hot encoded vectors to the respective learned embeddings.
 5. Thecomputer-implemented of claim 1, wherein the chunks are concatenatedinto the stack along a channel dimension.
 6. The computer-implemented ofclaim 1, wherein the stack is translated into the compressedrepresentation by processing the stack through a linear projection. 7.The computer-implemented of claim 6, wherein the linear projection usesa plurality of one-dimensional (1D) convolution filters.
 8. Thecomputer-implemented of claim 1, wherein the aggregated representationis translated into the identities of the masked residues by processingthe aggregated representation through a revelation output head.
 9. Thecomputer-implemented of claim 1, wherein p = m.
 10. Thecomputer-implemented of claim 1, wherein each of the k Boolean tilescauses concealment of the corresponding one of the k columns in thecorresponding one of the k embedding tiles, and causes revelation of atleast some of the other ones of the k columns in the corresponding oneof the k embedding tiles.
 11. The computer-implemented of claim 1,wherein each of the k Boolean tiles causes concealment of acorresponding subset of the k columns in the corresponding one of the kembedding tiles, and causes revelation of at least some of the otherones of the k columns in the corresponding one of the k embedding tiles.12. A system, comprising: memory storing a multiple sequence alignment(MSA) with a plurality of masked residues; chunking logic configured tochunk the MSA into a series of chunks; first attention logic configuredto attend to a representation of the series of chunks and produce afirst attention output; first aggregation logic configured to produce afirst aggregated output that contains those features in the firstattention output that correspond to masked residues in the plurality ofmasked residues; mask revelation logic configured to produce an informedoutput based on the first aggregated output and a Boolean mask that, ona subset-by-subset basis, alternates between concealing a given subsetof the masked residues and revealing remaining subsets of the maskedresidues; second attention logic configured to attend to the informedoutput and produce a second attention output based on masked residuesrevealed by the Boolean mask; second aggregation logic configured toproduce a second aggregated output that contains those features in thesecond attention output that correspond to masked residues concealed bythe Boolean mask; and output logic configured to produce identificationsof the masked residues based on the second aggregated output.
 13. Thesystem of claim 12, wherein the first attention logic usesaxial-attention.
 14. The system of claim 12, wherein the secondattention logic uses self-attention.
 15. A computer-implemented method,including: accessing a multiple sequence alignment (MSA), wherein theMSA has p rows and r columns, wherein the p rows correspond to p proteinsequences, and wherein the r columns correspond to r residue positions;accessing a mask grid, wherein the mask grid has m mask distributions,and wherein each of the m mask distributions has k periodically-spacedmasks at k ordinal positions; applying the m mask distributions to mprotein sequences in the p protein sequences to generate apartially-masked MSA that contains masked residues and unmaskedresidues, where p > m; translating the masked residues and the unmaskedresidues into learned embeddings, concatenating the learned embeddingswith residue position embeddings to generate an embedded representationof the partially-masked MSA; chunking the embedded representation into aseries of chunks, concatenating chunks in the series of chunks into astack, and translating the stack into a compressed representation of theembedded representation; iteratively applying axial-attention across mrows and r columns of the compressed representation, and interleavingthe applied attention to generate an updated representation of thecompressed representation; aggregating, from the updated representation,k updated representation tiles, wherein each of the k updatedrepresentation tiles contains those updated representation features ofthe updated representation that correspond to the masked residues;aggregating, from the embedded representation, k embedding tilescorresponding to the k updated representation tiles, wherein each of thek embedding tiles contains those embedding features in a first chunk ofthe series of chunks that are translations of the masked residues;applying k Boolean tiles to the k embedding tiles to generate kBooleaned embedding tiles, wherein each of the k Boolean tiles causesconcealment of a corresponding one of k columns in a corresponding oneof the k embedding tiles, and causes revelation of other ones of the kcolumns in the corresponding one of the k embedding tiles; concatenatingthe k Booleaned embedding tiles with the k updated representation tilesto generate k concatenated tiles, and translating the k concatenatedtiles into k compressed tile representations of the k concatenatedtiles; iteratively applying self-attention to the k compressed tilerepresentations to generate interpretations of those compressed tilefeatures in the k compressed tile representations that correspond tothose embedding features in the k embedding tiles that are revealed bythe k Boolean tiles; aggregating those interpreted features from theinterpretations that correspond to those embedding features in the kembedding tiles that are concealed by the k Boolean tiles to generate anaggregated representation of the interpretations; and translating theaggregated representation into identities of the masked residues. 16.The computer-implemented of claim 15, wherein the k periodically-spacedmasks of at least some of the m mask distributions begin at varyingoffsets from a first residue position in mask grid.
 17. Thecomputer-implemented of claim 16, wherein the k periodically-spacedmasks of at least some of the m mask distributions begin at a sameoffset from the first residue position.
 18. The computer-implemented ofclaim 15, wherein the compressed representation has m rows and rcolumns.
 19. The computer-implemented of claim 15, wherein each of the kupdated representation tiles has m rows and k columns, wherein a givencolumn in the k columns of a given updated representation tile containsa respective subset of the updated representation features, wherein therespective subset is located at a given ordinal position in the kordinal positions, and wherein the given ordinal position is representedby the given column.
 20. The computer-implemented of claim 15, whereineach of the k embedding tiles has m rows and k columns, wherein a givencolumn in the k columns of a given embedding tile contains a respectivesubset of the embedding features, wherein the respective subset islocated at a given ordinal position in the k ordinal positions, andwherein the given ordinal position is represented by the given column.